Source code: https://github.com/djlofland/DATA624_F2020_Group/tree/master/
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both RPubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
Insert short overview of what the data contains and what we are trying to accomplish by building a model. Discuss our overall approach and what models might be appropriate.
Our team’s analysis seeks to build understanding of the ABC Beverage manufacturing process and the related factors that affect the pH of the company’s beverage products. We apply machine learning approaches–specifically, a series of supervised learning algorithms–to company data to build then select a predictive model of pH. This model could help the company adapt its processes in a changing regulatory environment.
Describe the size and the variables in the training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below.
The training data set contains 32 categorical, continuous, or discrete features and 2571 rows, with 267 rows reserved for an evaluation set that lacks the target. That target is PH, which should be a continuous variable but has 52 distinct values in the training set. As a result, possible predictive models could include regression, classification, or an ensemble of both.
There are two files provided:
PH, the feature we seek to predict.PH. Our model will have to be scored by an outside group with knowledge of the actual pH values.Note: Both Excel files are in simple CSV format.
# Load crime dataset
df <- read_excel('datasets/StudentData.xlsx')
df_eval <- read_excel('datasets/StudentEvaluation.xlsx')
# remove the empty PH column from the evaluation data
df_eval <- df_eval %>%
dplyr::select(-PH)Below is a list of the variables of interest in the data set:
Brand Code: categorical, values: A, B, C, DCarb Volume:Fill Ounces:PC Volume:Carb Pressure:Carb Temp:PSC:PSC Fill:PSC CO2:Mnf Flow:Carb Pressure1:Fill Pressure:Hyd Pressure1:Hyd Pressure2:Hyd Pressure3:Hyd Pressure4:Filler Level:Filler Speed:Temperature:Usage cont:Carb Flow:Density:MFR:Balling:Pressure Vacuum:PH: the TARGET we will try to predict.Bowl Setpoint:Pressure Setpoint:Air Pressurer:Alch Rel:Carb Rel:Balling Lvl:We compiled summary statistics on our dataset to better understand the data before modeling.
| Name | df |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Fill Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Hyd Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| Carb Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Carb Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Balling Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
First, across features, there are numerous missing data–coded as NA–that will need to be imputed. Especially note that 4 rows are missing a PH value. We will need to drop these rows as they cannot be used for training. Second, the basic histograms suggest that skewness is prevalent across features. Examples include PSC CO2 and MFR. And third, some of the skewed features appear to show near-zero variance, with a large number of 0 or even negative values, e.g. Hyd Pressure1 and Hyd Pressure2. In general, the skewness and imbalance may require imputation.
If we treat PH as a classification problem, we need to understand any class imbalance, as this may impact predicted classification. Our class balance is:
PH is normally distributed with possible outliers on the low and high ends. This distribution suggests a pure classification approach could be problematic as the predictions may favor pH values in the mid-range (where there are more data points). Natural boundaries may exist such that classification adds predictive information. However, given the normal shape, a regression or possible ensemble with regression and classification seems more appropriate.
Before continuing, let us better understand any patterns of missingness across predictor features.
| variable | n_miss | pct_miss |
|---|---|---|
| MFR | 212 | 8.2458187 |
| Brand Code | 120 | 4.6674446 |
| Filler Speed | 57 | 2.2170362 |
| PC Volume | 39 | 1.5169195 |
| PSC CO2 | 39 | 1.5169195 |
| Fill Ounces | 38 | 1.4780241 |
| PSC | 33 | 1.2835473 |
| Carb Pressure1 | 32 | 1.2446519 |
| Hyd Pressure4 | 30 | 1.1668611 |
| Carb Pressure | 27 | 1.0501750 |
| Carb Temp | 26 | 1.0112797 |
| PSC Fill | 23 | 0.8945935 |
| Fill Pressure | 22 | 0.8556982 |
| Filler Level | 20 | 0.7779074 |
| Hyd Pressure2 | 15 | 0.5834306 |
| Hyd Pressure3 | 15 | 0.5834306 |
| Temperature | 14 | 0.5445352 |
| Oxygen Filler | 12 | 0.4667445 |
| Pressure Setpoint | 12 | 0.4667445 |
| Hyd Pressure1 | 11 | 0.4278491 |
| Carb Volume | 10 | 0.3889537 |
| Carb Rel | 10 | 0.3889537 |
| Alch Rel | 9 | 0.3500583 |
| Usage cont | 5 | 0.1944769 |
| PH | 4 | 0.1555815 |
| Mnf Flow | 2 | 0.0777907 |
| Carb Flow | 2 | 0.0777907 |
| Bowl Setpoint | 2 | 0.0777907 |
| Density | 1 | 0.0388954 |
| Balling | 1 | 0.0388954 |
| Balling Lvl | 1 | 0.0388954 |
| Pressure Vacuum | 0 | 0.0000000 |
| Air Pressurer | 0 | 0.0000000 |
Notice that approximately 8.25 percent of the rows are missing a value for MFR. We may need to drop this feature considering that, as missingness increases, so do the potential negative consequences of imputation. Additionally, the categorical feature Brand Code is missing approximately 4.67 percent of its values. Since we do not know whether these values represent another brand or are actually missing, we will create a new feature category ‘Unknown’ consisting of missing values. The rest of the features are only missing small percentages of values, suggesting that KNN imputation should be safe.
Next, we visualize the distributions of each of the predictor features. The visuals will help us select features for modeling, assess relationships between features and with PH, and identify outliers as well as transformations that might improve model resolution.
The distribution profiles show the prevalence of kurtosis, specifically right skew in variables Oxygen Filler, PSC, and Temperature and left skew in Filler Speed and MFR. These deviations from a traditional normal distribution can be problematic for linear regression assumptions, and thus we might need to transform the data. Several features are discrete with limited possible values, e.g. Pressure Setpoint. Furthermore, we have a number of bimodel features–see Air Pressurer, Balling, and Balling Level.
Bimodal features in a dataset are problematic but interesting, representing areas of potential opportunity and exploration. They suggest the existence of two different groups, or classes, within a given feature. These groups may have separate but overlapping distributions that could provide powerful predictive power in a model.
Were we tackling in-depth feature engineering in this analysis, we could leverage the package, mixtools (see R Vignette). This package helps regress mixed models where data can be subdivided into subgroups. We could then add new binary features to indicate for each instance, the distribution to which it belongs.
Here is a quick example showing a possible mix within Air Pressurer:
# Select `Air Pressurer` column and remove any missing data
df_mix <- df %>%
dplyr::select(`Air Pressurer`) %>%
tidyr::drop_na()
# Calculate mixed distributions for indus
air_pressure_mix <- normalmixEM(df_mix$`Air Pressurer`,
lambda = .5,
mu = c(140, 148),
sigma = 1,
maxit=60)## number of iterations= 7
# Simple plot to illustrate possible bimodal mix of groups
plot(air_pressure_mix,
whichplots = 2,
density = TRUE,
main2 = "`Air Pressurer` Possible Distributions",
xlab2 = "Air Pressurer")Lastly, several features have relatively normal distributions along with high numbers of values at an extreme. We have no information on whether these extreme values are mistakes, data errors, or otherwise inexplicable. As such, we will need to review each associated feature to determine whether to impute the values, leave them as is, or apply feature engineering.
We also elected to use boxplots to understand the spread of each feature.
The boxplots reveal outliers, though none of them seem egregious enough to warrant imputing or removal. Outliers should only be imputed or dropped if we have reason to believe they are errant or contain no critical information.
Next, we generate scatter plots of each predictor versus the target to get an idea of the relationship between them.
The scatter plots indicate some clear relationships between our target and predictor features, such as PH and Oxygen Filter or PH and Alch Rel. However, we also see clear correlations between some of the predictors, like Carb Temp and Carb Pressure. Overall, although our plots indicate some interesting relationships, they also underline the aforementioned possible issues with the data. For instance, many predictors have skewed distributions, and in some cases, missing data may be recorded as ‘0’.
We next quantify the relationships visualized above. In general, our model should focus on features showing stronger positive or negative correlations with PH. Features with correlations closer to zero will probably not provide any meaningful information on pH levels.
## values ind
## 1 0.361587534 Bowl Setpoint
## 2 0.352043962 Filler Level
## 3 0.233593699 Carb Flow
## 4 0.219735497 Pressure Vacuum
## 5 0.196051481 Carb Rel
## 6 0.166682228 Alch Rel
## 7 0.164485364 Oxygen Filler
## 8 0.109371168 Balling Lvl
## 9 0.098866734 PC Volume
## 10 0.095546936 Density
## 11 0.076700227 Balling
## 12 0.076213407 Carb Pressure
## 13 0.072132509 Carb Volume
## 14 0.032279368 Carb Temp
## 15 -0.007997231 Air Pressurer
## 16 -0.023809796 PSC Fill
## 17 -0.040882953 Filler Speed
## 18 -0.045196477 MFR
## 19 -0.047066423 Hyd Pressure1
## 20 -0.069873041 PSC
## 21 -0.085259857 PSC CO2
## 22 -0.118335902 Fill Ounces
## 23 -0.118764185 Carb Pressure1
## 24 -0.171434026 Hyd Pressure4
## 25 -0.182659650 Temperature
## 26 -0.222660048 Hyd Pressure2
## 27 -0.268101792 Hyd Pressure3
## 28 -0.311663908 Pressure Setpoint
## 29 -0.316514463 Fill Pressure
## 30 -0.357611993 Usage cont
## 31 -0.459231253 Mnf Flow
It appears that Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum, and Carb Rel have the highest correlations (positive) with PH, while Mnf Flow, Usage cont, Fill Pressure, Pressure Setpoint, and Hyd Pressure3 have the strongest negative correlations with PH. The other features have a weak or slightly negative correlation, which implies they have less predictive power.
One problem that can occur with multi-variable regression is a correlation between variables, called Multicolinearity. A quick check is to run correlations between variables.
We can see that some variables are highly correlated with one another, such as Balling Level and Carb Volume, Carb Rel, Alch Rel, Density, and Balling, with a correlation between 0.75 and 1. When we start considering features for our models, we’ll need to account for the correlations between features and avoid including pairs with strong correlations.
As a note, this dataset is challenging as many of the predictive features go hand-in-hand with other features and multicollinearity will be a problem.
Lastly, we want to check for any features that show near zero-variance. Features that are the same across most of the instances will add little predictive information.
## freqRatio percentUnique zeroVar nzv
## Hyd Pressure1 31.11111 9.529366 FALSE TRUE
Hyd Pressure1 shows little variance - we will drop this feature.
To summarize our data preparation and exploration, we can distinguish our findings into a few categories below:
MFR has more than 8% missing values - remove this feature.Hyd Pressure1 shows little variance - remove this feature.PH that need to be removed.Brand Code with “Unknown”kNN() from the VIM package# drop rows with missing PH
df_clean <- df_clean %>%
filter(!is.na(PH))
# Change Brand Code missing to 'Unknown' in our training data
brand_code <- df_clean %>%
dplyr::select(`Brand Code`) %>%
replace_na(list(`Brand Code` = 'Unknown'))
df_clean$`Brand Code` <- brand_code$`Brand Code`
# Change Brand Code missing to 'Unknown' in our evaluation data
brand_code <- df_eval_clean %>%
dplyr::select(`Brand Code`) %>%
replace_na(list(`Brand Code` = 'Unknown'))
df_eval_clean$`Brand Code` <- df_eval_clean$`Brand Code`
# There is an edge case where our Eval data might have a `Brand Code` not seen in our training set.
# If so, let's convert them to 'Unknown'. This is appropriate since any model trained without the
# new value wouldn't be able to glean any info from it.
codes <- unique(df_clean$`Brand Code`)
df_eval_clean <- df_eval_clean %>%
mutate(`Brand Code` = if_else(`Brand Code` %in% codes, `Brand Code`, 'Unknown'))
# Use the kNN imputing method from VIM package to impute missing values in our training data
df_clean <- df_clean %>%
kNN(k=10) %>%
dplyr::select(colnames(df_clean))
# Use the kNN imputing method from VIM package to impute missing values in our training data
df_eval_clean <- df_eval_clean %>%
kNN(k=10) %>%
dplyr::select(colnames(df_eval_clean))No outliers were removed as all values seemed reasonable.
Brand Code is a categorical variable with values A, B, C, D and Unknown. For modeling, we will convert this to a set of dummy columns.
# -----
# Training data - Convert our `Brand Code` column into a set of dummy variables
df_clean_dummy <- dummyVars(PH ~ `Brand Code`, data = df_clean)
dummies <- predict(df_clean_dummy, df_clean)
# Get the dummy column names
dummy_cols <- sort(colnames(dummies))
# Make sure the new dummy columns are sorted in alpha order (to make sure our columns will match the eval dataset)
dummies <- as.tibble(dummies) %>%
dplyr::select(dummy_cols)## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(dummy_cols)` instead of `dummy_cols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# remove the original categorical feature
df_clean <- df_clean %>%
dplyr::select(-`Brand Code`)
# add the new dummy columns to our main training dataframe
df_clean <- cbind(dummies, df_clean)
# -----
# Evaluation data - Convert our `Brand Code` column into a set of dummy variables
#df_eval_clean <- dummyVars(PH ~ `Brand Code`, data = df_eval_clean)
df_eval_clean$PH <- 1
eval_dummies <- predict(df_clean_dummy, df_eval_clean)
# Edge Case - if the eval dataset is doesn't have a specific `Brand Code`
# we will be missing the necessary dummy column. Let's check and if necessary add
# appropriate dummy columns with all 0's.
for (c in dummy_cols) {
if (!(c %in% colnames(eval_dummies))) {
eval_dummies[c] <- 0
}
}
# Now sort the eval_dummy columns so they match the training set dummies
eval_dummy_cols <- sort(colnames(eval_dummies))
eval_dummies <- as.tibble(eval_dummies) %>%
dplyr::select(eval_dummy_cols)## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(eval_dummy_cols)` instead of `eval_dummy_cols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
Finally, as mentioned earlier in our data exploration, and our findings from our histogram plots, we can see that some of our variables are highly skewed. To address this, we decided to scale, center and BoxCox transform (using caret preProcess) to make them more normally distributed.
## Created from 2567 samples and 34 variables
##
## Pre-processing:
## - Box-Cox transformation (22)
## - centered (34)
## - ignored (0)
## - scaled (34)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00000 -1.90000 -0.15000 -0.08182 1.15000 2.00000
Here are some plots to demonstrate the changes in distributions before and after the transformations:
# Prepare data for ggplot
gather_df <- df_transformed %>%
dplyr::select(-c(PH)) %>%
gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(gather_df) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~variable, scales='free', ncol=4)As expected, the dummy variables, e.g. ``Brand Code``A appear are binary and we still have bimodal features as we didn’t apply any feature engineering on them. A few still show skew, e.g. PSC Fill and Temperature, but they are closer to normal.
With our transformations complete, we can now continue on to building our models.
Using the training data, build at least three different models. Since we have multicollinearity, we should select appropriate models intolerant of feature correlations or do feature selection
Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter-intuitive? Why? The boss needs to know.
With a solid understanding of our dataset at this point, and with our data cleaned, we can now start to build out candidate models. We will explore …(LR, PLS?, KNN?, SVM?, )
First, we decided to split our cleaned dataset into a training and testing set (80% training, 20% testing). This was necessary as the provided holdout evaluation dataset doesn’t provide PH values so we cannot measure our model performance against that dataset.
set.seed(181)
# utilizing one dataset for all four models
training_set <- createDataPartition(df_transformed$PH, p=0.8, list=FALSE)
df_train <- df_transformed[training_set,]
df_test <- df_transformed[-training_set,]Using our training dataset, we decided to run a binary logistic regression model that included all non-transformed features that we hadn’t removed following our data cleaning process mentioned above.
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
# Model 1 - Build Multi-Linear Regression
model1_raw <- lm(PH ~ ., df_train)
# Build model 1 - only significant features (using stepAIC)
model1 <- stepAIC(model1_raw, direction = "both",
scope = list(upper = model1_raw, lower = ~ 1),
scale = 0, trace = FALSE)
stopCluster(cl)
# Display Model 1 Summary
(lmf_s <- summary(model1))##
## Call:
## lm(formula = PH ~ `\`Brand Code\`A` + `\`Brand Code\`B` + `\`Brand Code\`C` +
## `Fill Ounces` + `PC Volume` + PSC + `PSC CO2` + `Mnf Flow` +
## `Carb Pressure1` + `Fill Pressure` + `Hyd Pressure2` + `Hyd Pressure3` +
## `Filler Level` + Temperature + `Usage cont` + `Carb Flow` +
## Density + Balling + `Pressure Vacuum` + `Oxygen Filler` +
## `Bowl Setpoint` + `Pressure Setpoint` + `Alch Rel` + `Balling Lvl`,
## data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50945 -0.07982 0.00982 0.08504 0.43065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.547150 0.002915 2932.148 < 2e-16 ***
## `\\`Brand Code\\`A` -0.016743 0.004227 -3.961 7.73e-05 ***
## `\\`Brand Code\\`B` 0.029783 0.006849 4.348 1.44e-05 ***
## `\\`Brand Code\\`C` -0.024726 0.004844 -5.104 3.63e-07 ***
## `Fill Ounces` -0.005894 0.003064 -1.924 0.054533 .
## `PC Volume` -0.008119 0.003577 -2.270 0.023313 *
## PSC -0.006766 0.003069 -2.205 0.027584 *
## `PSC CO2` -0.006481 0.002972 -2.180 0.029353 *
## `Mnf Flow` -0.086780 0.006344 -13.679 < 2e-16 ***
## `Carb Pressure1` 0.030781 0.003577 8.605 < 2e-16 ***
## `Fill Pressure` 0.011077 0.004300 2.576 0.010066 *
## `Hyd Pressure2` -0.023931 0.008542 -2.802 0.005131 **
## `Hyd Pressure3` 0.062535 0.010064 6.214 6.25e-10 ***
## `Filler Level` -0.016405 0.009345 -1.755 0.079332 .
## Temperature -0.017323 0.003320 -5.217 2.00e-07 ***
## `Usage cont` -0.019289 0.003768 -5.120 3.35e-07 ***
## `Carb Flow` 0.009855 0.003864 2.551 0.010824 *
## Density -0.020765 0.009086 -2.285 0.022401 *
## Balling -0.088497 0.014300 -6.189 7.32e-10 ***
## `Pressure Vacuum` -0.014525 0.004279 -3.394 0.000702 ***
## `Oxygen Filler` -0.011546 0.004079 -2.830 0.004694 **
## `Bowl Setpoint` 0.053075 0.009502 5.586 2.64e-08 ***
## `Pressure Setpoint` -0.017965 0.004387 -4.095 4.38e-05 ***
## `Alch Rel` 0.027232 0.011035 2.468 0.013678 *
## `Balling Lvl` 0.107707 0.016128 6.678 3.11e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1319 on 2030 degrees of freedom
## Multiple R-squared: 0.4153, Adjusted R-squared: 0.4084
## F-statistic: 60.08 on 24 and 2030 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 8.541432972 8.5528662976
## `\\`Brand Code\\`A` -0.025032457 -0.0084526361
## `\\`Brand Code\\`B` 0.016350785 0.0432148570
## `\\`Brand Code\\`C` -0.034225499 -0.0152255359
## `Fill Ounces` -0.011902406 0.0001147270
## `PC Volume` -0.015132935 -0.0011046087
## PSC -0.012783804 -0.0007475681
## `PSC CO2` -0.012309755 -0.0006513135
## `Mnf Flow` -0.099222383 -0.0743385016
## `Carb Pressure1` 0.023765704 0.0377968490
## `Fill Pressure` 0.002643829 0.0195097510
## `Hyd Pressure2` -0.040682619 -0.0071802953
## `Hyd Pressure3` 0.042798604 0.0822704157
## `Filler Level` -0.034731970 0.0019220412
## Temperature -0.023834539 -0.0108108559
## `Usage cont` -0.026677337 -0.0119001011
## `Carb Flow` 0.002277800 0.0174320499
## Density -0.038584542 -0.0029450834
## Balling -0.116541117 -0.0604526406
## `Pressure Vacuum` -0.022916789 -0.0061323473
## `Oxygen Filler` -0.019546262 -0.0035462859
## `Bowl Setpoint` 0.034440071 0.0717092181
## `Pressure Setpoint` -0.026567991 -0.0093622248
## `Alch Rel` 0.005590770 0.0488734326
## `Balling Lvl` 0.076077342 0.1393359230
# Display Variable feature importance plot
variableImportancePlot(model1, "Model 1 LM Variable Importance")## [1] "VIF scores of predictors"
## `\\`Brand Code\\`A` `\\`Brand Code\\`B` `\\`Brand Code\\`C` `Fill Ounces`
## 2.112206 5.538897 2.716281 1.101626
## `PC Volume` PSC `PSC CO2` `Mnf Flow`
## 1.479643 1.103823 1.030926 4.760599
## `Carb Pressure1` `Fill Pressure` `Hyd Pressure2` `Hyd Pressure3`
## 1.525962 2.187856 8.628236 12.049196
## `Filler Level` Temperature `Usage cont` `Carb Flow`
## 10.325127 1.340368 1.690024 1.803029
## Density Balling `Pressure Vacuum` `Oxygen Filler`
## 9.766020 24.149939 2.122035 1.981706
## `Bowl Setpoint` `Pressure Setpoint` `Alch Rel` `Balling Lvl`
## 10.724204 2.281632 14.349801 30.617008
Applying Model 1 against our Test Data:
# Predict df_test and calculate performance
model1_pred <- predict(model1, df_test)
# Merge the results into a data frame called results
results <- data.frame()
results <- data.frame(t(postResample(pred = model1_pred, obs = df_test$PH))) %>%
mutate(Model = "Mutiple Regression") %>%
rbind(results)INSERT discussion here
Let’s skip Partial Least Squares (PLS) model here. Ridge Regression
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
# to find the right lambda using cv.glmnet
x_train <- model.matrix(PH ~ ., data = df_train)
x_test <- model.matrix(PH ~ ., data = df_test)
cv.glmnet <- cv.glmnet(x_train, df_train$PH, alpha = 0)
ridge_model <- glmnet(x_train, df_train$PH, alpha = 0, lambda = cv.glmnet$lambda.min)
stopCluster(cl)Applying Model 2 against our Test Data:
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
# training the elastic net regression model using train
elas_net_model <- train(
PH ~ ., data = df_train, method = "glmnet",
trControl = trainControl("repeatedcv", repeats = 8),
tuneLength = 4
)
stopCluster(cl)Applying Model 3 against our Test Data:
Insert discussion
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
# using method avNNET from train, which is to aggregate several neural network models by averaging
nnetGrid <- expand.grid(.decay = c(0.1, 0.5), .size = c(1,10), .bag = FALSE)
nnet.model <- train(PH ~ ., data = df_train, method = "avNNet", preProcess = c("center",
"scale"), tuneGrid = nnetGrid, trControl = trainControl(method = "repeatedcv",
repeats = 1), trace = FALSE, linout = TRUE, maxit = 500)
nnet.model## Model Averaged Neural Network
##
## 2055 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 1850, 1850, 1848, 1850, 1849, 1850, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.1 1 0.1348875 0.3853322 0.10478518
## 0.1 10 0.1118050 0.5810853 0.08372700
## 0.5 1 0.1355328 0.3794013 0.10567083
## 0.5 10 0.1162188 0.5449561 0.08699266
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 10, decay = 0.1 and bag = FALSE.
Applying Model 4 against our Test Data:
Insert discussion
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
knnModel <- train(PH ~ ., data = df_train, method = "knn", preProc = c("center","scale"), tuneLength = 10)
knnModel## k-Nearest Neighbors
##
## 2055 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1321244 0.4353097 0.09721426
## 7 0.1295880 0.4473265 0.09594943
## 9 0.1281461 0.4552398 0.09586276
## 11 0.1279108 0.4556613 0.09626631
## 13 0.1282175 0.4523853 0.09701364
## 15 0.1282884 0.4513195 0.09730159
## 17 0.1285513 0.4487817 0.09783543
## 19 0.1288389 0.4464627 0.09822312
## 21 0.1291738 0.4435842 0.09857563
## 23 0.1294574 0.4413430 0.09895038
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
Applying Model 5 against our Test Data:
Generalized Boosted Regression Modeling (GBM)
Insert discussion
set.seed(244)
df_transformed1 <- df_transformed %>% dplyr::select (-PH)
X.train <- df_transformed1[training_set, ]
y.train <- df_transformed$PH[training_set]
X.test <- df_transformed1[-training_set, ]
y.test <- df_transformed$PH[-training_set]
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
grid <- expand.grid(n.trees = c(50, 100, 150, 200), interaction.depth = c(1, 5, 10,
15), shrinkage = c(0.01, 0.1, 0.5), n.minobsinnode = c(5, 10, 15))
gbm_Model <- train(x = X.train, y = y.train, method = "gbm", tuneGrid = grid, verbose = FALSE # turn off the status of training process
)
stopCluster(cl)
plot(gbm_Model)## n.trees interaction.depth shrinkage n.minobsinnode
## 92 200 15 0.1 10
## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 34 predictors of which 34 had non-zero influence.
Applying Model 6 against our Test Data:
Multivariate Adaptive Regression Splines (MARS)
Insert discussion
options(max.print = 1e+06)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
mars.grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
mars.model <- train(x = X.train, y = y.train, method = "earth", tuneGrid = mars.grid,
preProcess = c("center", "scale"), tuneLength = 10)## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
##
## rescale
## Loading required package: TeachingDemos
## Multivariate Adaptive Regression Spline
##
## 2055 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.1515914 0.2164641 0.1188890
## 1 3 0.1450665 0.2827160 0.1140665
## 1 4 0.1429866 0.3033961 0.1117112
## 1 5 0.1417045 0.3161686 0.1105850
## 1 6 0.1401261 0.3314006 0.1092807
## 1 7 0.1391452 0.3411372 0.1083753
## 1 8 0.1378953 0.3530805 0.1070356
## 1 9 0.1364122 0.3666891 0.1057504
## 1 10 0.1351448 0.3786833 0.1047728
## 1 11 0.1342633 0.3869863 0.1040020
## 1 12 0.1334817 0.3940839 0.1035001
## 1 13 0.1329654 0.3988214 0.1029334
## 1 14 0.1357624 0.3897371 0.1029411
## 1 15 0.1357732 0.3908238 0.1027151
## 2 2 0.1516755 0.2155325 0.1188739
## 2 3 0.1457908 0.2757945 0.1146464
## 2 4 0.1485873 0.2872105 0.1126108
## 2 5 0.1470371 0.3029630 0.1107327
## 2 6 0.1457457 0.3172231 0.1096626
## 2 7 0.1457553 0.3266903 0.1087959
## 2 8 0.1440412 0.3461631 0.1065605
## 2 9 0.1415631 0.3559741 0.1055063
## 2 10 0.1396238 0.3682122 0.1041464
## 2 11 0.1388524 0.3774486 0.1031582
## 2 12 0.1376942 0.3861125 0.1023645
## 2 13 0.1368998 0.3924895 0.1013338
## 2 14 0.1362133 0.3997770 0.1005367
## 2 15 0.1353128 0.4026833 0.1001965
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 13 and degree = 1.
Applying Model 7 against our Test Data:
Insert discussion
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(244)
cubist_Model <- train(x = X.train, y = y.train, method = "cubist")
cubist_Model## Cubist
##
## 2055 samples
## 34 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.1439406 0.4128685 0.09861123
## 1 5 0.1424728 0.4385004 0.09711965
## 1 9 0.1417079 0.4377051 0.09650108
## 10 0 0.1072778 0.6112037 0.07778919
## 10 5 0.1053312 0.6269784 0.07595580
## 10 9 0.1049245 0.6288182 0.07564614
## 20 0 0.1042166 0.6337074 0.07529430
## 20 5 0.1023700 0.6460943 0.07341209
## 20 9 0.1019174 0.6488303 0.07316633
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 9.
Applying Model 8 against our Test Data:
## Model RMSE Rsquared
## 1 Cubist Model 0.0937689 0.7193900
## 2 Multivariate Adaptive Regression Splines (MARS) 0.1307506 0.4543483
## 3 Generalized Boosted Regression Modeling (GBM) 0.1000686 0.6789421
## 4 k-Nearest Neighbors(kNN) 0.1201111 0.5426207
## 5 Neural Network (avNNET - Modeling Averaging) 0.1073165 0.6313222
## 6 ElasticNet Regression 0.1313219 0.4500410
## 7 Ridge Regression 0.1319699 0.4491237
## 8 Mutiple Regression 0.1318883 0.4439717
## MAE
## 1 0.06879833
## 2 0.10112677
## 3 0.07496804
## 4 0.09249645
## 5 0.08149964
## 6 0.10275847
## 7 0.10331111
## 8 0.10280182
Insert discussion and summary
For the model, will you use a metric such as log-likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.
Insert discussion here
We apply Model #N to the holdout evaluation set to predict the targets for these instances. We have saved these predictions as csv in the file eval_predictions.csv.
Source code: https://github.com/djlofland/DATA624_F2020_Group/tree/master/eval_predictions.csv
# Copy final R code here and hide it up above